Margins and -ml-

The IV-Probit model

Introduction

Previously, I have shown how to use -margins- after -ml-, for the linear regression model (under normality assumption), and for the probit model.

Today, I'll provide yet another example, where the goal is to estimate a two equation model. Namely, the infamous "IVprobit", or instrumental variable probit.

I call this infamous, because it has been subject of multiple threads on Statalist, because of the consistency (or lack there off) regarding estimated marginal effects. you can read some of the latest discussion here.

The bottom line is that through different Stata versions, margins produced different versions of "marginal effects" for this model. All of them correct, but correct from a computational point of view.

Based on my own assessment, corroborated by discussions with colleagues this is the summary of what Stata would do:

Stata 13: Estimated the correct marginal effects for the ivprobit MLE. But not for the two step approach.
Stata 14 & 15: Estimated the Full information marginal effect (I ll explain later what this is). While correct, if considered mechanically, it contradicts common sense. This has been discussed extensibly by Prof Wooldridge. He advocates for a) MLE marginal effects or b) two-step marginal effects.
Stata 16: Due to the recent discussions, an update has been provided and -margins- now produces something equivalent to the two-step marginal effect...but for the MLE estimation. I suspect it will change again too.

So, without further ado, let me present you my own version of how to estimate an ivprobit model, and how the predict program could be set up, and why we had so many types of marginal effects lurking around.

The Setup

First thing first. Unless you already have this program saved somewhere in your accesible ado files (most likely the "ado/personal" folder), make sure to have the following program in memory.

. program adde, eclass
  1.         ereturn `0'
  2. end

So, lets start.

The IVprobit model is a nonlinear model that can be used when your dependent variable is a binary variable (0 - 1), thus you want to estimate the "probability of success (y=1)" given a set of characteristics. $ P(y=1|X) $., but you have some continuous and endogenous explanatory variable.

Econometrics 101. If a variable is endogenous, we cannot estimate the model and interpret the results as causal effects, because changes in the endogenous variable may happen at the same time as changes in unobserved components. Thus, if the outcome changes, we do not know if it is because the endogenous variable changed, or because the unobservables changed. (they are, after all, correlated).

In this case, one option we have to deal with the unobserved confounders is to use instruments to either isolate exogenous variation of the variable of interest (using the 2sls approach for example) or obtain an approximation of the endogenous component that we can control for (Control function approach).

IV-probit is in fact the application of the latter. A control function approach.

Formally, IVprobit model can be written as follows: $$y_2 = z_1 \delta_1 + z_2 \delta_2 + u_2 \quad (1) $$ $$y^*_1 = z_1 \beta_1 + y_2 \beta_2 + u_1 \quad (2) $$ $$y_1= 1(y^*_1>0)$$

where the errors $ (u_1,u_2) $ follow a bivariate normal distribution: $$ \begin{pmatrix} u_1 \\ u_2 \end{pmatrix} \sim Normal \begin{pmatrix} \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & \rho \sigma_2\\ \rho \sigma_2 & \sigma^2_2 \end{pmatrix} \end{pmatrix} $$ In this model $ z_1 $ is the set of exogenous variables that affect $ y^*_1 $, and $ z_2 $ are the instruments.

It is also easy to see that $ y_2 $ is endogenous, because: $$ if \quad corr(u_1,u_2) \neq 0 \quad and \quad corr(y_2,u_2) \neq 0 \rightarrow corr(y_2,u_1) \neq 0 $$

So how do we estimate this model? by parts!

First, equation (1) can be estimated directly, because it is a function of exogenous variables only. Thus, we could estimate that equation using standard OLS (as ivprobit-two-step does), or via MLE assuming the normality of the errors $ u_2 $.

The one that requires more attention is equation (2). We know that $ corr(y_2,u_1) $ is different from zero, which is the cause of the endogeneity of $ y_2 $. We could, however, "decompose" $ u_1 $, into two parts. One that contains the endogenous component, and one that is exogenous and uncorrelated with all other variables. First, recall that if $ (u_1,u_2) $ follows a bivariate normal distribution, then, conditional on $ u_2 $, $ u_1 $ will have the following distribution: $$ u_1 \sim N \left( \rho \frac{u_2}{\sigma_2}, \sqrt{1-\rho^2} \right) $$ This means that: $$ u_1 = \rho \frac{u_2}{\sigma_2} + v_1 \quad (3) $$ so that $ v_1 $ is uncorrelated with $ y_2 $. We can replace equation (3) into (2) which gives us: $$y^*_1 = z_1 \beta_1 + y_2 \beta_2 + \rho \frac{u_2}{\sigma_2} + v_1 (4) $$ This equation could now be estimated directly, assuming we observe $ u_2 $. However, to be estimated using a probit model, we need to rescale the equation so that the error $v_1 $ has a variance of 1.

This can be done by simply dividing all terms by $ \sqrt(1-\rho^2) $, and estimating the following model using standard probit model. $$\frac{y^*_1}{\sqrt{1-\rho^2})} = z_1 \frac{\beta_1}{\sqrt{1-\rho^2}} + y_2 \frac{\beta_2}{\sqrt{1-\rho^2}} + \frac{\rho}{\sqrt{1-\rho^2}} \frac{u_2}{\sigma_2} + \frac{v_1}{\sqrt{1-\rho^2}} \quad (5) $$ or perhaps the simpler: $$y^{**}_1 = z_1 \beta^r_1 + y_2 \beta^r_2 + \theta \frac{u_2}{\sigma_2} + \nu_1 \quad (6) $$

So what is the difference between using one equation or the other?. I would argue none, as long as you know how to estimate the standard errors from the system.

The actual estimation

So, how do we estimate this ivprobit model? I would say that there are at least three ways of doing so.

The first method is what is typically known as the two-step approach. In the first step, one estimates equation (1) using, say, OLS. obtain the predicted residuals $ \hat{u}_2 $, plug them in equation (6), which can be estimated using a simple probit model. Of course, in doing so, one does not obtain estimates for the structural coefficients, but only the "rescaled" ones.

The main difficulty of using this method is the correct estimation of the standard errors of all coefficients.

Many textbooks would say it is a "simple" application of a delta method, or, use of bootstrap, but the fact of the matter is that you need some way to account that $ \hat{u}_2 $ is a variable that is measured with error when you estimate equation (5).

The second method is to estimate equations (1) and (4) simultaneously using full information maximum likelihood. This imposes the assumption that the errors follow a bivariate normal distribution, and allows you to obtain estimates for the structural parameters in equation (2), in addition to the "linked" parameters $ \sigma_2 $ and $ \rho $.

Under this strategy, the contribution of a single observation to the likelihood function becomes: $$ L_i = L^1_i * L^2_i \quad (7a)$$ $$ L^1_i=\phi \left( y_2,z_1 \delta_1 + z_2 \delta_2, \sigma_2 \right) \quad (7b)$$ $$ L^2_i=\Phi \left( \frac{ z_1 \beta_1 + y_2 \beta_2 + \rho \frac{y_2 -z_1 \delta_1 - z_2 \delta_2}{\sigma_2}} {\sqrt{1-\rho^2}} \right) \quad if \quad y_1 ==1 (7c)$$ $$ L^2_i=1-\Phi \left( \frac{ z_1 \beta_1 + y_2 \beta_2 + \rho \frac{y_2 -z_1 \delta_1 - z_2 \delta_2}{\sigma_2}} {\sqrt{1-\rho^2}} \right) \quad if \quad y_1 ==0 (7d)$$ Notice that instead of "plugging in" $ \hat{u}_2 $ in the second equation, we plug $ y_2 -z_1 \delta_1 - z_2 \delta_2 $. This allows to implicitly account for the measurement errors of the first stage.

There is a third option, which I call "two-step-mle". I call it two-stage because the ivprobit will be estimated using equations (1) and (5). However, I call it MLE, because both equations are estimated simultaneously using MLE: $$ L_i = L^1_i * L^2_i \quad (8a)$$ $$ L^1_i=\phi \left( y_2,z_1 \delta_1 + z_2 \delta_2, \sigma_2 \right) \quad (8b)$$ $$ L^2_i=\Phi \left( z_1 \beta^r_1 + y_2 \beta^r_2 + \theta \frac {y_2 -z_1 \delta_1 - z_2 \delta_2}{\sigma_2} \right) \quad (8c)$$ The main difference with the standard FIML, is that only rescaled coefficients are are estimated, and that the link between both equation is not $ \rho $, but $ \theta $.

However, for this simplified example, both equations identify exactly the same model.

Compared to the usual two-step approach, however, because the model is estimated simultaneously, the standard errors of all coefficients can be correctly estimated, without further calculations (no delta method nor bootstrap).

One last thing to notice in this model. First, there is a close relationship between $ \rho $ and $ \theta $: $$\theta = \frac{\rho}{\sqrt{1-\rho ^2 }} \quad (9a)$$ $$\rho ^2 = \frac{\theta ^2 }{{1+\theta^2 }} \quad (9b)$$ and between the rescaled parameters (we will need this later on): $$\beta_1 = \beta^r_1 * {\sqrt{1-\rho^2}} = \frac{\beta^r_1}{\sqrt{1+\theta^2}} \quad (10)$$

The Log Likelihood function.

While I have shown that using FIML and two-step-ml will provide virtually the same results, I'll stick with the two-step approach, as it allows me to derive marginal effects telling the story of what happened to -margins- through different Stata versions.

The following program defines this the log-likelihood function for the IV probit, using the two-step approach (equation 8), using the following walk-through for the specification:

xb will contain all the exogenous variables $ z_1 $ plus the endogenous variable $ y_2 $.
zb will contain all the exogenous variables $ z_1 $ and instruments $ z_2 $.

. capture program drop myivprobit_2sls

. program myivprobit_2sls
  1.         args lnf xb  theta  zb lnsigma
  2.         qui {
  3.                 local y1 $ML_y1
  4.                 local y2 $ML_y2
  5.                 local u2 (`y2'-`zb')
  6.  
.                 tempvar xb_zb p1 p0
  7.                 gen double `xb_zb'= `xb'+`theta'*((`u2')/exp(`lnsigma')) 
  8.                 gen double `p1'   = normal( `xb_zb')
  9.                 gen double `p0'   = normal(-`xb_zb')
 10.                 tempvar lnf1 lnf2
 11.                 gen double `lnf1'  = log(normalden(`y2', `zb', exp(`lnsigma')))
 12.                 gen double `lnf2' = log(`p1') if `y1'==1
 13.                 replace    `lnf2' = log(`p0') if `y1'==0
 14.                 replace `lnf' = `lnf1' + `lnf2'
 15.         }
 16.  
. end

The predict program

So finally, the part that will be a bit more controversial. The prediction of the probability of success!.

The reason why this is controversial is because there are two candidates have been suggested to identify this expression.

The first candidate relates to the structural equation (2). Basically, if we can estimate the unscaled coefficients, the predicted outcome could be identified by: $$ P(y_1=1| z_1 , y_2) = \Phi \left( z_1 \beta_1 + y_2 \beta_2 \right) $$ or if one prefers the version based on rescaled coefficients: $$ P(y_1=1| z_1 , y_2) = \Phi \left( z_1 \beta^r_1 * {\sqrt{1-\rho^2}} + y_2 \beta^r_2 * {\sqrt{1-\rho^2}} \right) $$ And marginal effects can be obtained by analyzing this equation alone. Standard errors for this expression (at the corresponding marginal effects) can be identified directly only if we estimate the the IVprobit by FIML, or using the rescaled coefficients, making sure standard errors are calculated acounting for the estimation errors of the first stage.

The second option relates to estimate the marginal effects using equation (6): $$ P(y_1=1| z_1 , y_2, \hat{u}_2 ) = \Phi \left( z_1 \beta^r_1 + y_2 \beta^r_2 + \theta \hat{u}_2 \right) $$ However, because $ \hat{u}_2 $ is never really observed, it is usually recommended to average (but not ignore) the impact of $ \hat{u}_2 $ on the equation: $$ P(y_1=1| z_1 , y_2) = E \left( \Phi \left( z_1 \beta^r_1 + y_2 \beta^r_2 + \theta \hat{u}_2 \right)| z_1, y_2 \right) $$ The bottom line: If one uses the two-step approach, marginal effects could be estimated assuming $ \hat{u}_2 $ is just another exogenous variable in the model. The difficulty would be obtaining the correct estimation of standard errors. (specially because they will depend on the measurement errors of $ \hat{u}_2 $.

Now, the FIML option is the most efficient, because standard errors of $ P(y_1=1|.) $ depend on $ \beta_r $'s and $ \rho $. However the two-step procedure depend on $ \beta_r's $, $ \rho $, and $ \hat{u}_2 $.

So Lets write these two options into a "predict" program.

. program myivprobit_p
  1.         syntax newvarname [if] [in] , [ pr1 pr2  *]
  2.         if "`pr1'`pr2'" =="" {
  3.             ml_p `0'
  4.         }
  5.         tokenize `e(depvar)'
  6.         local y1  `1'
  7.         local y2  `2'
  8.         marksample touse, novarlist
  9.         if "`pr1'" !=""  {
 10.             tempvar xb zb theta lnsigma
 11.             _predict double `xb'   , eq(#1)
 12.                 _predict double `theta', eq(#2)
 13.                 _predict double `zb'   , eq(#3) 
 14.                 _predict double `lnsigma', eq(#4)       
 15.                 gen `typlist' `varlist' = normal(`xb'+`theta'*(`y2'-`zb')/exp(`lnsigma')) if `touse'
 16.                 label var `varlist' "P(y=1|X) two-step"
 17.         }       
 18.         else if "`pr2'"!="" {
 19.                 tempvar xb zb theta lnsigma
 20.             _predict double `xb' , eq(#1)
 21.                 _predict double `theta'  , eq(#2)
 22.                 gen `typlist' `varlist' = normal(`xb'/sqrt(1+`theta'^2)) if `touse'
 23.                 label var `varlist' "P(y=1|X) FIML"
 24.         }
 25. end

so, the first option -pr1- will estimate the predicted probability as if the model were estimated using the two step approach, whereas the second will estimate the predicted probability based on the structural equation.

Alright, so lets estimate the model and compare the results with the built-in ivprobit command:

. clear  

. webuse laborsup

. global  y1   fem_work

. global  z1   fem_educ   kids  

. global  y2   other_inc

. global  z2   male_educ   

. 
. *Built in command:
. ivprobit $y1  $z1 ($y2 = $z2), two
Checking reduced-form model...

Two-step probit with endogenous regressors        Number of obs   =        500
                                                  Wald chi2(3)    =      93.97
                                                  Prob > chi2     =     0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   other_inc |   -.058473   .0093364    -6.26   0.000    -.0767719    -.040174
    fem_educ |    .227437   .0281628     8.08   0.000     .1722389     .282635
        kids |  -.1961748   .0496323    -3.95   0.000    -.2934522   -.0988973
       _cons |   .3956061   .4982649     0.79   0.427    -.5809752    1.372187
------------------------------------------------------------------------------
Instrumented:  other_inc
Instruments:   fem_educ kids male_educ
------------------------------------------------------------------------------
Wald test of exogeneity: chi2(1) = 6.50                   Prob > chi2 = 0.0108

. *my ivprobit two-step
. ml model lf myivprobit_2sls ($y1 = $z1  $y2 )  (theta:) ($y2 = $z1 $z2  ) (lnsigma:) , ///
>                     technique(nr bhhh)   init(lnsigma:_cons = 2.81 ) maximize nolog

. ml display

                                                Number of obs     =        500
                                                Wald chi2(3)      =      94.01
Log likelihood = -2368.2062                     Prob > chi2       =     0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
eq1          |
    fem_educ |    .227437   .0281561     8.08   0.000      .172252     .282622
        kids |  -.1961748   .0496179    -3.95   0.000    -.2934242   -.0989254
   other_inc |   -.058473   .0093339    -6.26   0.000    -.0767671   -.0401788
       _cons |   .3956062   .4981099     0.79   0.427    -.5806714    1.371884
-------------+----------------------------------------------------------------
theta        |
       _cons |   .4008085   .1626174     2.46   0.014     .0820843    .7195328
-------------+----------------------------------------------------------------
eq3          |
    fem_educ |   .3351866   .2825972     1.19   0.236    -.2186937     .889067
        kids |   .8329055   .5475666     1.52   0.128    -.2403052    1.906116
   male_educ |   2.845253    .282746    10.06   0.000     2.291081    3.399425
       _cons |   9.872562   5.029193     1.96   0.050      .015524     19.7296
-------------+----------------------------------------------------------------
lnsigma      |
       _cons |   2.813383   .0316228    88.97   0.000     2.751404    2.875363
------------------------------------------------------------------------------

you can see right away that except for differences attributed to rounding errors, and degrees of freedom the results are virtually the same.

It is also reasuring to see that the results are also the same when we compared ivprobit-mle and the rescaled coefficients:

. *FIML
. ivprobit $y1  $z1 ($y2 = $z2), ml

Fitting exogenous probit model

Iteration 0:   log likelihood = -344.63508  
Iteration 1:   log likelihood = -252.10819  
Iteration 2:   log likelihood = -252.04529  
Iteration 3:   log likelihood = -252.04529  

Fitting full model

Iteration 0:   log likelihood = -2368.2142  
Iteration 1:   log likelihood = -2368.2062  
Iteration 2:   log likelihood = -2368.2062  

Probit model with endogenous regressors         Number of obs     =        500
                                                Wald chi2(3)      =     163.88
Log likelihood = -2368.2062                     Prob > chi2       =     0.0000

----------------------------------------------------------------------------------------------
                             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------------------+----------------------------------------------------------------
                   other_inc |  -.0542756   .0060854    -8.92   0.000    -.0662028   -.0423485
                    fem_educ |    .211111   .0268648     7.86   0.000     .1584569    .2637651
                        kids |  -.1820929   .0478267    -3.81   0.000    -.2758315   -.0883542
                       _cons |   .3672086   .4480724     0.82   0.412    -.5109971    1.245414
-----------------------------+----------------------------------------------------------------
 corr(e.other_inc,e.fem_work)|   .3720375   .1300518                      .0946562    .5958136
              sd(e.other_inc)|   16.66621   .5270318                      15.66461    17.73186
----------------------------------------------------------------------------------------------
Instrumented:  other_inc
Instruments:   fem_educ kids male_educ
----------------------------------------------------------------------------------------------
Wald test of exogeneity (corr = 0): chi2(1) = 6.70        Prob > chi2 = 0.0096

. est sto ivp

. *my ivprobit two-step
. ml model lf myivprobit_2sls ($y1 = $z1  $y2 )  (theta:) ($y2 = $z1 $z2  ) (lnsigma:) , ///
>                     technique(nr bhhh)   init(lnsigma:_cons = 2.81 ) maximize nolog

. adde local predict myivprobit_p                 

. est sto myivp                   

. *with rescaled coefficients:
. nlcom   (other_inc: _b[other_inc]/sqrt(1+_b[theta:_cons]^2)) ///
>                 (fem_educ: _b[fem_educ]/sqrt(1+_b[theta:_cons]^2)) ///
>                 (kids: _b[kids]/sqrt(1+_b[theta:_cons]^2)) ///
>                 (cons: _b[_cons]/sqrt(1+_b[theta:_cons]^2)) 

   other_inc:  _b[other_inc]/sqrt(1+_b[theta:_cons]^2)
    fem_educ:  _b[fem_educ]/sqrt(1+_b[theta:_cons]^2)
        kids:  _b[kids]/sqrt(1+_b[theta:_cons]^2)
        cons:  _b[_cons]/sqrt(1+_b[theta:_cons]^2)

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   other_inc |  -.0542756   .0060854    -8.92   0.000    -.0662028   -.0423485
    fem_educ |    .211111   .0268648     7.86   0.000     .1584569    .2637651
        kids |  -.1820929   .0478267    -3.81   0.000    -.2758316   -.0883543
        cons |   .3672086   .4480724     0.82   0.412    -.5109971    1.245414
------------------------------------------------------------------------------

Where the results are exactly the same.

A Story of marginal effects

Let me now walk you through the Story of marginal effects with ivprobit.

Back in Stata 13, marginal effects for IV probit were estimated using the structural equation coeffients: $$ P(y_1=1| z_1 , y_2) = \Phi \left( z_1 \beta_1 + y_2 \beta_2 \right) $$ So that marginal effects were defined as: $$\frac{\partial P(y_1=1|.)}{\partial z_1} = \phi ( z_1 \beta_1 + y_2 \beta_2 ) \beta_1 $$ $$\frac{\partial P(y_1=1|.)}{\partial y_2} = \phi ( z_1 \beta_1 + y_2 \beta_2 ) \beta_2 $$ While I currently do not have access to Stata13, this is what it would produce under version control from Stata 14.2:

. //margins, dydx(*) predict(pr)
. //
. //Average marginal effects                        Number of obs     =        500
. //Model VCE    : OIM
. //
. //Expression   : Prob of positive outcome when rho=0, predict(pr)
. //dy/dx w.r.t. : other_inc fem_educ kids male_educ
. //
. //------------------------------------------------------------------------------
. //             |            Delta-method
. //             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
. //-------------+----------------------------------------------------------------
. //   other_inc |   -.014015   .0009836   -14.25   0.000    -.0159428   -.0120872
. //    fem_educ |   .0545129   .0066007     8.26   0.000     .0415758      .06745
. //        kids |  -.0470199   .0123397    -3.81   0.000    -.0712052   -.0228346
. //   male_educ |          0  (omitted)
. //------------------------------------------------------------------------------

This marginal effects are emulated using pr2 after myivprobit:

. est restore myivp
(results myivp are active now)

. margins, dydx(*) predict(pr2) force
(note: prediction is a function of possibly stochastic quantities other than e(b))

Average marginal effects                        Number of obs     =        500
Model VCE    : OIM

Expression   : P(y=1|X) FIML, predict(pr2)
dy/dx w.r.t. : fem_educ kids other_inc male_educ

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    fem_educ |   .0545129   .0066003     8.26   0.000     .0415766    .0674493
        kids |  -.0470199   .0123394    -3.81   0.000    -.0712047   -.0228351
   other_inc |   -.014015   .0009837   -14.25   0.000    -.0159431   -.0120869
   male_educ |          0  (omitted)
------------------------------------------------------------------------------

You will see that the above command also includes "male_educ" in the list of exogenous variables, because this is an explanatory variable for at least one equation in the model. However, this not being inlcuded in the structural equation, has a no effect on $P(y=1|x) $.

When we reached Stata 14.1. A change was introduced in how probabilities were calculated after ivprobit. As it says in the "whatsnew" material, the new formulation would take into account endogeneity.

Specifically, they use what I call the 2sls predicted probabilities, following equation (5): $$P(y^**1 >0|z_1,y_2,u_2) =\Phi \left( z_1 \frac{\beta_1}{\sqrt{1-\rho^2}} + y_2 \frac{\beta_2}{\sqrt{1-\rho^2}} + \frac{\rho}{\sqrt{1-\rho^2}} \frac{u_2}{\sigma_2} \right ) \quad (11) $$ With the caveat $ u_2 $ was substituted by $ y_2 - z_1 \delta_1 - z_2 $: $$P(y^{**}_1 >0|z_1,y_2)= P(y_1=1|z_1,y_2,u_2) =\Phi \left( z_1 \frac{\beta_1}{\sqrt{1-\rho^2}} + y_2 \frac{\beta_2}{\sqrt{1-\rho^2}} + \frac{\rho}{\sqrt{1-\rho^2}} \frac{y_2 - z_1 \delta_1 - z_2 \delta_2}{\sigma_2} \right ) \quad (12) $$ While the two equations above are basically the same, they have important differences when marginal effects are estimated by software.

So let me explain first what Stata 14.1, and 15 did.

To estimate marginal effects, partial derivatives were based on equation 12: $$ \frac{\partial P(y=1|.)}{\partial z_1} = \phi(.)*\left( \frac{\beta_1}{\sqrt{1-\rho^2}} -\frac{\rho}{\sqrt{1-\rho^2}} * \frac{\delta_1}{\sigma_2} \right) $$ $$ \frac{\partial P(y=1|.)}{\partial z_2} = \phi(.)*\left( 0 -\frac{\rho}{\sqrt{1-\rho^2}} * \frac{\delta_2}{\sigma_2} \right) $$ $$ \frac{\partial P(y=1|.)}{\partial y_2} = \phi(.)*\left( \frac{\beta_2}{\sqrt{1-\rho^2}} +\frac{1}{\sqrt{1-\rho^2}} * \frac{1}{\sigma_2} \right) $$ So, from the technical point of view, this partial derivatives are correct, since they are capturing the direct, and indirect partial effects. A kind of total derivative, rather than partial derivatives.

The problem, however, is that this assumes we could actually observe how the unobserved component $ u_2 $ changes when other variables change. Standard regression analysis, however, would say that these unobserved components should be considered as fixed, and instead one should estimate marginal effects averaging over the unobserved factors. Thus, the second term on each one of the above derivatives should be zero.

Nevertheless, lets replicate this. First, if you are using Stata 16, the following behaivor may no longer be replicable:

. //. margins, dydx(*) predict(pr)
. //
. //Average marginal effects                        Number of obs     =        500
. //Model VCE    : OIM
. //
. //Expression   : Probability of positive outcome, predict(pr)
. //dy/dx w.r.t. : other_inc fem_educ kids male_educ
. //
. //------------------------------------------------------------------------------
. //             |            Delta-method
. //             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
. //-------------+----------------------------------------------------------------
. //   other_inc |  -.0097802   .0014994    -6.52   0.000     -.012719   -.0068414
. //    fem_educ |   .0623273    .007099     8.78   0.000     .0484135     .076241
. //        kids |  -.0614265   .0139446    -4.41   0.000    -.0887574   -.0340956
. //   male_educ |  -.0194406   .0022103    -8.80   0.000    -.0237728   -.0151084
. //------------------------------------------------------------------------------

If I would like to replicate this using myivprobit, I would estimate marginal effects using option "pr1", and requesting derivates to be estimated without the "chain" option. This makes sure that one takes into account the effect of all changes in $ y_2 , z_1 $ on the outcome $ P(y=1|.) $. Not using this option would ignore, for example, the effect of $y_2$ that comes through the error $u_2$. (again, assuming this derivatives are correct).

. est restore myivp
(results myivp are active now)

. margins, dydx(*) predict(pr1) force nochain
(note: prediction is a function of possibly stochastic quantities other than e(b))

Average marginal effects                        Number of obs     =        500
Model VCE    : OIM

Expression   : P(y=1|X) two-step, predict(pr1)
dy/dx w.r.t. : fem_educ kids other_inc male_educ

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    fem_educ |   .0623273   .0060316    10.33   0.000     .0505055     .074149
        kids |  -.0614265   .0128383    -4.78   0.000    -.0865891   -.0362639
   other_inc |  -.0097802   .0009828    -9.95   0.000    -.0117065   -.0078539
   male_educ |  -.0194406   .0074886    -2.60   0.009    -.0341181   -.0047631
------------------------------------------------------------------------------

So what do we see here? First, that the Stata 14, 15, and early 16 versions were estimating a partial effect that inadvertently accounted for a second-order effect (through the first stage regression). This would normally have a small effect on the exogenous variables but will have a noticeable effect on the endogenous variable of interest.

Some people would even report negative marginal effects even when the estimated coefficient was positive.

We also see the odd outcome that the instrument, here male_educ, would also appear in the output, capturing only a second-order effect on the outcome of interest.

Just as a side note, you could also try estimating this without the "nochain" option. If you do that, you will find that the effects for the endogenous variable "improve". This will happen because the second-order effect would not be estimated for the endogenous variable, but only for the exogenous variables. So finally we arrive at the corrected marginal effects for Stata version 16.

After a very interesting discussion in Statalist, where Prof Wooldridge intervenes, it was said that the way Stata (and margins) was estimating marginal effects was incorrect. As I show here, that was the case because partial derivatives were being estimated through the first and second equations.

What Prof. Wooldridge suggested was to take a step back, and estimate marginal effects using a manual two-step approach, obtaining standard errors via bootstrap. This implies using equation (11) to estimate the partial effects.

How does this make a difference?
This makes a difference because we will be making the explicit assumption that $ \hat{u}_2 $ does not change when $ y_2, z_1 \quad or \quad z_2 $ changes. This will modify the partial effects to the following: $$ \frac{\partial P(y=1|.)}{\partial z_1} = \phi(.)*\left( \frac{\beta_1}{\sqrt{1-\rho^2}} \right) $$ $$ \frac{\partial P(y=1|.)}{\partial z_2} = \phi(.)* 0 $$ $$ \frac{\partial P(y=1|.)}{\partial y_2} = \phi(.)* \left( \frac{\beta_2}{\sqrt{1-\rho^2}} \right) $$ The main differences with the "structural" marginal effects are that the evaluation of $ \phi(.) $ includes the values predicted values of the errors; and the coefficients used correspond to the two-step procedure ones (rescaled) To show empirically how this works, we can compare the builtin command, with the "two-step" procedure suggested by Prof. Wooldridge:

. * two step procedure
. reg $y2 $z1 $z2

      Source |       SS           df       MS      Number of obs   =       500
-------------+----------------------------------   F(3, 496)       =     34.36
       Model |  28864.2732         3  9621.42439   Prob > F        =    0.0000
    Residual |  138881.269       496  280.002558   R-squared       =    0.1721
-------------+----------------------------------   Adj R-squared   =    0.1671
       Total |  167745.542       499  336.163411   Root MSE        =    16.733

------------------------------------------------------------------------------
   other_inc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    fem_educ |   .3351866   .2837344     1.18   0.238    -.2222829    .8926562
        kids |   .8329056   .5497701     1.52   0.130    -.2472597    1.913071
   male_educ |   2.845253   .2838838    10.02   0.000      2.28749    3.403016
       _cons |   9.872562   5.049432     1.96   0.051    -.0483506    19.79347
------------------------------------------------------------------------------

. predict double u2, resid

. probit $y1 $z1 $y2 u2

Iteration 0:   log likelihood = -344.63508  
Iteration 1:   log likelihood = -252.10819  
Iteration 2:   log likelihood = -252.04529  
Iteration 3:   log likelihood = -252.04529  

Probit regression                               Number of obs     =        500
                                                LR chi2(4)        =     185.18
                                                Prob > chi2       =     0.0000
Log likelihood = -252.04529                     Pseudo R2         =     0.2687

------------------------------------------------------------------------------
    fem_work |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    fem_educ |    .227437   .0273171     8.33   0.000     .1738964    .2809775
        kids |  -.1961748   .0478084    -4.10   0.000    -.2898776    -.102472
   other_inc |   -.058473   .0090228    -6.48   0.000    -.0761573   -.0407887
          u2 |   .0240492   .0094295     2.55   0.011     .0055677    .0425306
       _cons |   .3956061   .4784993     0.83   0.408    -.5422352    1.333448
------------------------------------------------------------------------------

. margins, dydx(*) predict(pr)

Average marginal effects                        Number of obs     =        500
Model VCE    : OIM

Expression   : Pr(fem_work), predict(pr)
dy/dx w.r.t. : fem_educ kids other_inc u2

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    fem_educ |   .0646175   .0060271    10.72   0.000     .0528045    .0764304
        kids |  -.0557355   .0129302    -4.31   0.000    -.0810782   -.0303929
   other_inc |  -.0166128   .0022499    -7.38   0.000    -.0210225   -.0122032
          u2 |   .0068326    .002632     2.60   0.009     .0016741    .0119912
------------------------------------------------------------------------------

Standard errors would not be corrrect, but bootstrap could be applied to obtain corrected standard errors.

. ** built-in command
. 
. qui:ivprobit $y1  $z1 ($y2 = $z2), ml

. margins, dydx(*) predict(pr)

Average marginal effects                        Number of obs     =        500
Model VCE    : OIM

Expression   : Average structural function probabilities, predict(pr)
dy/dx w.r.t. : other_inc fem_educ kids

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   other_inc |  -.0166128   .0012889   -12.89   0.000     -.019139   -.0140867
    fem_educ |   .0646175   .0073529     8.79   0.000      .050206     .079029
        kids |  -.0557355   .0144233    -3.86   0.000    -.0840047   -.0274664
------------------------------------------------------------------------------

So you can see that the two-step approach and the built-in approach now provide the same marginal effects. And since the official command estimates all coefficients simultaneously, the standard errors can be taken as correct...Let me get back to this point later.

So how can we correct for this with our predict program. Since there is nothing to prevent margins to obtain numerical derivative across both equations, we need to modify slighly how the model is estimated. First, we create clone copies of all variables that enter the second stage: z_1 and y_2, and use them for the model estimation:

. clonevar c_other_inc = other_inc

. clonevar c_fem_educ  = fem_educ

. clonevar c_kids      = kids

. global  y2b c_other_inc

. global  z1b c_fem_educ c_kids 

. 
. ml model lf myivprobit_2sls ($y1 = $z1  $y2 )  (theta:) ($y2b = $z1b $z2  ) (lnsigma:) , ///
>                     technique(nr bhhh)   init(lnsigma:_cons = 2.81 ) maximize nolog

. adde local predict myivprobit_p                 

. est sto myivp

The idea of using "clones" of the exogenous variables $ z_1 $ and endogenous $ y_2 $ is to have access to the same information as the original data, but making sure $ \hat{u}_2 $ does not change when the original data changes.

Marginal effects can be calculated a I did before, except that I now make it explicit to request marginal effects with respect to $z_1 \quad \& \quad y_2 $ only.

. est restore myivp       
(results myivp are active now)

. margins, dydx($z1 $y2) predict(pr1) force 

Average marginal effects                        Number of obs     =        500
Model VCE    : OIM

Expression   : P(y=1|X) two-step, predict(pr1)
dy/dx w.r.t. : fem_educ kids other_inc

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    fem_educ |   .0646175    .006331    10.21   0.000     .0522089    .0770261
        kids |  -.0557355   .0134693    -4.14   0.000    -.0821349   -.0293362
   other_inc |  -.0166128   .0023499    -7.07   0.000    -.0212185   -.0120072
------------------------------------------------------------------------------

And done. we have been able to reproduce the second version, two-step, marginal effects for the instrumental variable probit model, that follows the two-step approach advocated by Prof. Wooldridge. Furthermore, this also reproduces what ivprobit currently does!

There is only one last perky detail. If you look at the marginal effect standard errors I produce with the myivprobit command, and compare it with the marginal effects the ivprobit command produces, you will notice they are different.

I'm still uncertain why this may be happening, and I'm currently discussing with people at Stata about the "correct" way to obtain standard errors, but here are the cliff notes:

Recall that IVprobit uses the following formula to estimate marginal effects with respect to $y_2 $ is: $$ \frac{\partial P(y=1|.)}{\partial y_2} = \phi \left(z_1 \frac{\beta_1}{\sqrt{1-\rho^2}} + y_2 \frac{\beta_2}{\sqrt{1-\rho^2}} + \frac{\rho}{\sqrt{1-\rho^2}} \frac{\hat{u}_2}{\sigma_2} \right)* \left( \frac{\beta_2}{\sqrt{1-\rho^2}} \right) $$ My own version of predict uses the same formula, but modified for the two-step approach.
If one needs to estimate standard errors for these partial effects, it is necessary to account for the uncertainty of all estimated coefficients in the model, including the uncertainly of the errors $ \hat{u}_2 $. Unofficial words (still under discussion) indictates that the current formulation assumes $ \rho $ and $ \sigma_2 $ to be constant, when standard errors are calculated.

While this may seem incorrect, I understand the intuition behind this idea.
If you recall the estimation of marginal effects from the structural equation: $$\frac{\partial P(y_1=1|.)}{\partial y_2} = \phi ( z_1 \beta_1 + y_2 \beta_2 ) \beta_2 $$ you will see that this are not affected by $ \rho $ or $ \sigma_2 $. Perhaps this is one of the reasons why the currently estimated marginal effects standard errors are so similar to the ones based on the "old" structural marginal effects.

My own command, however, accounts for the uncertainty in $ \rho $ and $ \sigma_2 $. This also seems correct since two-step marginal procedures are expected to be less efficient than the Fullinformation counterparts.
Of course, if you prefer to have a tie-breaker on which one is correct, I can use a Bootstrap procedure to produce the elusive standard errors.
Basically, I'll use the manual two-step procedure, alomg with a 250 bootstrap repetitions, to report the results:

. program bs_ivprobit, eclass
  1.         reg $y2 $z1 $z2
  2.         capture drop u2
  3.         predict double u2, resid
  4.         probit $y1 $z1 $y2 u2
  5.         margins, dydx(*) predict(pr) nose post
  6. end

. bootstrap , reps(250) seed(1) nodots:bs_ivprobit

Average marginal effects                        Number of obs     =        500
                                                Replications      =        250

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    fem_educ |   .0646175   .0058983    10.96   0.000      .053057     .076178
        kids |  -.0557355   .0138358    -4.03   0.000    -.0828532   -.0286179
   other_inc |  -.0166128   .0023778    -6.99   0.000    -.0212732   -.0119525
          u2 |   .0068326   .0027569     2.48   0.013     .0014292    .0122361
------------------------------------------------------------------------------

In this case, it seems that the bootstrap estimates seem to favor my version of marginal effects and standard errors. However, you are free to take this as limited evidence, and use either strategy as your own.

Conclusion

The command -ml- is a powerful tool that can be used to estimate single or multiple equation models, as long as the loglikelihood functions (and their inter-relations) can be properly defined.

-margins- is also a very flexible command that can be easily combined with -ml- to expand the estimation of marginal effects for properly defined outcomes. While the command is flexible and relatively easy to use, these properties can also be double+edge swords, if one is not aware of the mechanics behind the actual estimation of partial effects.

In my view, the original estimation of marginal effects after iv-probit was correct, but the changes it received in Stata 14.1 introduced what we could call a bug, that was based on solid Math. However, unless you dig deeper into what ivprobit tries to estimate, it would be difficult to say why that change produced undesirable results.

While the last update on Stata 16 has made a correction that now produces correct partial effects (two-step like), I still believe there is a remaining bug in the program that may be affecting the estimation of standard errors. And of course, margins could also go back to the beginning, and provide users the option for the estimation of MLE-structural marginal effects.

Who knows, this may be going to the Statalist wishlist...

Epilog: one last change

Today's development! As of April 1st, 2021, the way of how IVprobit works has changed for the estimation of standard errors!

If you happen to read my post before, (or what you can see above), when I wrote it i made the case that Standard errors for IVprobit were not being estimated correctly, because they didn't account for errors on some of the structural parameters (rho and sigma). However, Today (April 1, 2021), Stata has pushed an update fixing this!. On the one side, the code above will no longer be reproducible. On the brighter side, it seems my hunch was right!

And now, standard errors for IVprobit are estimated the right way! The circle is complete!

So, just for closure, the example today:

. webuse laborsup
. global  y1   fem_work
. global  z1   fem_educ   kids  
. global  y2   other_inc
. global  z2   male_educ   
. qui: ivprobit $y1  $z1 ($y2 = $z2), ml
. margins, dydx(*) predict(pr)
Average marginal effects                        Number of obs     =        500
Model VCE    : OIM

Expression   : Average structural function probabilities, predict(pr)
dy/dx w.r.t. : other_inc fem_educ kids

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   other_inc |  -.0166128   .0023501    -7.07   0.000    -.0212189   -.0120068
    fem_educ |   .0646175   .0063309    10.21   0.000     .0522091    .0770258
        kids |  -.0557355   .0134692    -4.14   0.000    -.0821347   -.0293364
------------------------------------------------------------------------------

So no more Statalist wish! The change is done, and here to stay! Unless someone else finds something that needs to be addressed...